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THE  DESIGN  OF  TREE  AND  TRELLIS 
DATA  COMPRESSION  SYSTEMS* 

Joseph  Linde  and  Robert  M.  Gray 

Abstract 

Recent  results  in  information  theory  promise  the  existence  of  tree 
and  trellis  data  compression  systems  operating  near  the  Shannon  theoretical 
bound,  but  provide  little  indication  of  how  actually  to  design  such  systems. 
We  here  present  several  intuitive  design  approaches  and  also  a general 
design  philosophy  based  upon  the  generation  of  "fake  processes"  i.e., 
finite  entropy  processes  which  are  close  (in  the  generalized  Ornstein 
distance)  to  the  process  one  wishes  to  compress.  Most  of  the  design 
procedures  can  be  used  for  a wide  class  of  sources.  Performance  is 
evaluated,  via  simulations,  for  memoryless,  autoregressive  and  moving 
average  Gaussian  sources  and  compared  to  traditional  Data  Compression 
systems.  The  new  schemes  typically  provide  1-2  dB  improvement  in  per- 
formance over  the  traditional  schemes  at  a rate  of  1 bit/symbol.  The 
inevitable  increase  in  complexity  is  moderate  in  most  cases. 


This  research  was  supported  by  Air  Force  contract  F44620-73-0065  and 
by  the  Joint  Services  Electronics  Program  at  Stanford  under  U.S.  Navy 
Grant  N00013-67-A-0112 . 
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1 . Introduction 

A discrete  time  data  compression  or  source  coding  system  can  be 
viewed  as  a means  of  first  encoding  a sequence  of  continuous-alphabet 
data  (X^)  into  a sequence  of  binary  symbols,  say  (Yn) » and  then 

A 

decoding  the  binary  symbols  into  a reproduction  (X^} • The  system  is 
said  to  be  a fixed  rate  system  if  each  time  a fixed  number  of  source 
symbols  are  input  to  the  encoder  and  a fixed  number  of  binary  encoded 
symbols  are  output.  The  rate  of  the  system  R (in  bits/symbol)  is 
the  ratio  of  the  number  of  encoded  binary  digits  to  the  number  of 
source  symbols  in  a long  period  of  time.  The  smaller  the  rate,  the 
larger  the  "compression"  since  fewer  binary  symbols  must  then  be 
transmitted  in  a given  time  and  hence  the  required  bandwidth  is  smaller. 

In  most  communication  systems  the  source  has  a continuous  alphabet 
(usually  the  real  line)  and  an  infinite  rate  is  therefore  required  to 
communicate  the  source  perfectly.  Usually  however,  one  has  a finite 
rate  constraint  due  perhaps  to  a digital  communication  link,  a finite 
capacity  channel  or  a digital  storage  device.  Even  where  no  such  finite 
rate  constraint  exists,  it  may  be  convenient  to  impose  one  to  facilitate 


encryption  for  data  security. 

Regardless  of  the  motivation,  since  R = » is  in  general  required 
for  perfect  reproduction,  but  a finite  rate  is  used  instead,  distortion 
between  the  original  source  and  delivered  reproduction  must  inevitably 
result.  The  evaluation  of  lower  bounds  on  the  average  distortions  is 
the  main  concern  of  Rate-Distortion  theory  [ 1 , 2 ] . The  goal  of  data 
compression  is  to  design  systems  which  operate  close  to  these  unbeatable 


bounds  for  a given  rate  constraint.  Equivalently, one  may  be  given  a 
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distortion  or  fidelity  constraint  and  attempt  to  minimize  the  rate  to 
meet  this  requirement  (lower  rate  means  lower  cost  and  complexity  in 
later  communication  and  storage). 

The  previous  discussion  also  holds  for  continuous-time  information 
sources,  since  a discrete-time  source  can  always  be  obtained  (possibly 
at  the  cost  of  additional  distortion)  via  sampling  or  a transformation 
such  as  Karhunen-Loeve . Hence  we  have  confine  the  discussion  to  discrete 
time  systems. 

Data  compression  has  developed  along  two  largely  separate  but 
occasionally  connected  paths  [3]. 

(1)  Several  ad  hoc  but  often  quite  good  practical  data  compression 
systems  have  been  developed  in  particular  PCM,  DPCM , predictive 
quantization  and  delta-modulation  have  become  the  workhorses  of  fixed 
rate  data  compression. 

(2)  Theoretical  bounds  on  the  optimal  achievable  performance 
(for  a given  rate)  have  been  developed  in  information  theory  and  it 
has  been  shown  that  certain  classes  of  coding  structures  (such  as 
block  codes)  can  achieve  nearly  optimal  performance.  Unfortunately 
information  theory  rarely  tells  one  how  to  find  (or  design)  a good  code. 

The  principle  original  application  of  information  theory  to  data 
compression  system  design  was  the  demonstration  [4,5]  that  for  very  high 
rate  systems  and  memoryless  sources,  simple  PCM  performed  within  .25  bit 
of  the  theoretical  limit,  indicating  that  little  was  to  be  gained  by 
more  sophisticated  (and  expensive)  systems  in  such  a situation.  Memory- 
less sources  are  of  less  interest  than  sources  with  memory  since  the 
potential  gains  of  data  compression  are  usually  more  for  the  latter  as 


they  have  more  "redundancy"  to  remove.  Simulations  indicate  that  high 
rate  predictive  quantization  often  yield  good  performance  for  sources 
with  memory,  but  there  is  no  rigorous  counterpart  to  the  results  of 
[4,5]  (despite  some  prevalent  myths).  Proving  that  such  systems  in 
general  (or  even  in  particular  cases)  yield  nearly  optimal  performance 
I is  one  of  the  important  open  problems  in  information  theory.  In 

addi tion,  li ttle  is  known  about  what  schemes  work  well  when  a low  rate 
such  as  1 bit/symbol  is  required  (wherein  PCM  may  be  significantly 
suboptimal  even  for  a memoryless  source) . 

Recently  results  have  been  proved  in  information  theory  that  show 
that  nearly  optimal  performance  can  be  achieved  by  a class  of  data 
compression  systems  called  tree  or  trellis  encoding  systems  [6,  7,  8,  9, 
10].  Those  systems  consist  of  a (possibly  nonlinear)  digital  filter  as  a 
decoder  and  a matched  tree  or  trellis  search  as  the  encoder.  Furthermore, 
it  has  been  shown  that  the  decoding  filter  can  be  assumed  to  be  time- 
invariant  [lO]  in  which  case  the  system  is  referred  to  as  a sliding  block 
code  [ll,  12].  Such  coding  structures  are  of  potential  practical 
importance  since  similar  structures  have  been  found  to  yield  nearly 
optimal  performance  with  moderate  complexity  in  the  dual  problem  of 
channel  coding  for  error  correction. 

Viterbi  algorithms, which  are  used  in  the  trellis  code  case  (when  the 
decoder  is  a finite  state  device),  are  now  fairly  cheap  to  implement  and 

I the  decoder  is  simply  a digital  filter. 

We  are  considering  systems  that  have  the  structure  described  in 
Fig.  1.  The  decoder  consists  of  a shift  register  (finite  or  infinite) 
and  a (usually  nonlinear)  function  F:A  -♦  R where  A is  the  alphabet 
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of  the  decoder  input  sequence,  K the  decoder  shift  register  length 
and  11?  the  real  line.  The  encoder  is  a search  algorithm  matched 
to  the  decoder;  a tree  search  or  a trellis  search. 


As  an  example,  a rate  1 system  has  a decoder  that  at  each  time 
unit  accepts  a binary  symbol  and  outputs  a reproduction  symbol.  Since 
the  current  output  depends  only  on  the  current  binary  input  and  the  previous 
contents  of  the  shift  register  (the  state  of  the  decoder)  the  decoder 
behavior  can  be  described  by  the  tree  diagram  of  Fig.  2 (assuming  an 
initial  all  zero  state).  Each  tree  node  represents  a state  and  one 
advance  on  the  upper  path  if  "l"  is  input  and  on  the  lower  path  if 
"0"  is  input  and  outputs  the  "label"  on  the  branch  taken. 

If  the  decoding  register  has  a finite  size  K (there  is  a finite 

memory  and  no  feedback)  then  the  tree  has  a simplified  picture  obtained 

K-l 

by  realizing  that  there  are  only  2 distinct  states  and  hence  many 
of  the  tree  nodes  can  be  "merged."  The  resulting  structure  called 
a (time-invariant)  trellis  is  depicted  in  Figure  3 for  the  case 
K = 3. 

Since  the  decoder  uniquely  determines  the  trellis  or  tree,  the 
whole  problem  of  system  design  becomes  the  problem  of  choosing  the  best, 
or  at  least  a good,  decoder.  Once  a decoder  is  given,  the  encoder  is 
just  a search  algorithm  that  searches  the  trellis  or  the  tree  and 
finds  a good  path. 

In  practical  systems  the  delay  must  be  finite  and  hence  the  search 
algorithm  will  usually  fall  within  one  of  the  two  categories: 

(1)  Block  search  - The  encoder  inputs  a block  of  source  symbols 
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then  searches  for  a sequence  that  will  yield  the  best  repro- 


i 


duction  when  driving  the  decoder.  When  the  best  path  is  found, 
the  encoder  outputs  the  whole  string  and  proceeds  to  the  next 
non-overlapping  block.  After  each  block  the  encoder  may  drive 
the  decoder  back  to  the  all  zero  state  or  may  continue  from 
the  last  state  of  the  previous  block.  If  the  decoder  is 
finite  and  the  search  algorithm  is  the  Viterbi  algorithm  this 
system  is  referred  to  as  a Block  Trellis  Encoding  System. 

(2)  Incremental  Search  - The  encoder  inputs  a block  of  source 

symbols  and  searches  for  the  best  path,  then  the  first  step 
along  this  path  is  outputted.  The  next  input  symbol  is  read 
in  the  best  path  is  found  and  the  first  step  is  transmitted 
for  the  second  overlapping  block  and  so  on.  In  executing 
the  search,  information  from  previous  searches  can  be  used  to 
reduce  the  computation  time  significantly.  This  system  is  an 
example  of  a true  sliding  block  code. 

For  a trellis  search  the  Viterbi  algorithm  [13,  14]  is  used  almost 
universally.  For  a tree  search  one  can  use  a simple  exhaustive  search 
(if  the  block  is  very  short)  or  the  Stack,  Fano  or  M-L  algorithm 
[ 15,  1 (Chapter  6) ]. 

As  mentioned  before,  the  design  problem  of  a tree  or  trellis  coding 
system  is  the  problem  of  designing  the  nonlinear  filter  since  this  in 
turn  specifies  the  tree  or  trellis  search  by  defining  the  tree  or  trellis. 
The  design  problem  is  sometimes  described  as  that  of  finding  a rule  to 
"color"  or  "label"  the  tree. 

It  was  recently  proved  that  tree  and  trellis  codes,  operating  in  the 
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block  search  fashion  can  achieve  performance  arbitrarily  close  to  the 


theoretical  rate-distortion  bound  [10],  Although  it  is  believed  that 
incremental  tree  or  trellis  search  system  can  also  achieve  such 
performance , the  proof  is  still  an  open  problem. 

As  is  often  the  case  in  information  theory,  the  coding  theorems 
only  prove  the  existence  of  good  codes  having  a certain  structure,  they 
do  not  say  how  to  actually  design  such  a good  code.  The  few  attempts 
to  study  the  actual  performance  of  specific  tree  and  trellis  encoding 
systems  have  fallen  into  the  following  three  categories. 

(1)  Random  Coding  - The  traditional  proofs  of  the  tree  source  coding 
theorems  involve  random  coding  arguments,  that  is,  selecting 
the  branch  labels  at  random  according  to  a probability  distri- 
bution arising  in  the  evaluation  of  the  optimal  achievable  per- 
formance (distortion-rate  or  rate-distortion  function)  [6,  7, 

8,  9].  This  led  to  the  design  of  tree  codes  by  such  random 
"coloring"  of  the  trellis  [ 16] . Unfortunately,  however,  this 
method  results  in  a time-varying  decoding  filter  and  hence 
more  complicated  decoding  circuitry  and  a vastly  more 
complicated  tree  with  large  storage  requirements. 

(2)  Adaptive  Data-Filtering  - Mark  [ 17]  developed  a technique  for 
designing  a decoder  by  adaptive  Kalman-type  filtering  of  the 
source  data  to  produce  a source  model  with  which  to  color  the 
trellis.  This  technique  outperforms  predictive  quantization, 
but  it  also  requires  a time-varying  decoding  filter  and  the 
transmission  of  extra  "side  information"  to  tell  the  decoder 
how  the  trellis  is  colored. 
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(3)  Plagiarized  Decoder  - Several  authors  [ 18 , 19,  20]  suggested  a 
very  intuitive  technique  whereby  tree  encoding  is  used  to 
improve  traditional  techniques;  simply  take  the  usual  decoder 
of  a data  compression  system  (e.g.,  predictive  quantization 
or  delta  modulation)  but  use  a tree  search  algorithm  as  the 
encoder.  Since  an  exhaustive  tree  search  will  find  the  best 
path  through  the  tree  while  the  predictive  quantizer  finds  a 
path  through  the  tree,  the  performance  of  the  new  system 
cannot  be  worse  than  that  of  the  traditional  system  and 
actually  a considerable  gain  is  often  realized.  For  a practical 
tree  search  which  is  not  exhaustive  (usually  because  of 
computation  and  storage  requirements)  the  gain  is  somewhat 
smaller . 

In  this  paper  several  techniques  for  tree  and  trellis  encoding 
are  studied  and  their  performance  is  compared  with  that  of  the  traditional 
techniques  and  the  theoretical  bounds.  We  focus  on  time-invariant 
decoders  and  consider  one  bit  per  symbol  data  compression  systems  both 
for  simplicity  and  since  this  is  a less  well-understood  case.  All  the 
results  are  readily  extended  to  higher  rates.  To  allow  comparison  with 
the  traditional  techniques  of  PCM,  DPCM,  DM  and  predictive  quantization, 
we  consider  the  common  source  models  of  Gaussian:  memoryless,  autoregressive 
and  moving  average  processes.  Most  of  the  techniques  presented  are 
extendable  to  a much  larger  class  of  sources.  The  memoryless  case  may  at 
first  seem  to  be  beating  a dead  horse  but  it  is  useful  both  for  compari- 
son with  PCM  in  the  low  bit  rate  case  and  as  a step  in  designing  trellis 
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codes  for  sources  that  can  be  modelled  as  a filtered  memoryless  process. 


It  is  worth  noting  that  simple  one  bit  tree  or  trellis  encoding  systems 
can  provide  about  .7  dB  improvement  in  SNR  over  the  optimal  one  bit 
quantizer  for  a memoryless  Gaussian  source.  To  the  best  of  our  knowledge 
such  improvement  has  not  previously  been  realized  without  a random  coding 
approach  (and  hence  time-varying  systems)  in  this  case.  Also,  this 
apparently  low  improvement  is  significant  since  the  total  improvement 
cannot  be  greater  than  1.4  dB. 

Several  interesting  properties  of  these  systems  and  open  problems 
in  their  analyses  are  described.  We  believe  that  their  simplicity  and 
superiority  over  traditional  techniques  make  them  a hopeful  class  of 
systems . 

Most  of  the  forthcoming  results  were  obtained  by  simulations 
using  the  INTERDATA  minicomputer  running  under  DOS  and  PDP-11 
minicomputer  running  under  the  UNIX  system. 
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2. 


Preliminaries 


For  simplicity  and  to  facilitate  comparison  with  existing  systems 

we  here  consider  only  first-order  linear  Gaussian  sources,  that  is, 

sources  modelled  as  white  Gaussian  sources  filtered  by  first-order 

linear  filters.  Let  (zn)  136  a sequence  of  independent,  identically 

distributed  (i.i.d.,  memoryless,  white)  Gaussian  zero  mean  random 

2 2 

variables  with  variance  cr  = E Z . A first  order  Gaussian  autoregressive 

La  fl 

source  (X^}  (or  a Gauss  Markov  source)  is  defined  by  the  difference 
equation 

X = Z + a X _ , (2.1) 

n n n-1  ’ 

where  a is  called  the  autoregressive  constant.  In  other  words,  X^ 
is  the  output  of  the  first-order  autoregressive  filter  of  Fig.  4 driven 
by  Z . If  lal  < 1,  then  the  filter  is  stable  and  (X  "I  is 
stationary  and  (2.1)  makes  sense  for  all  integers  n and  we  can  write 

°°  k 

X Z a Z , (2.2) 

n k=0  n-X 

(where  a°  = 1)  if  |a|  > 1.  If  |a|  = 1 ( {X^}  is  the  discrete-time 

Wiener  process),  than  the  filter  is  unstable  and  (2.1)  must  be 

"initialized"  to  make  sense,  e.g. , set  X^  = 0 and  let  (2.1)  hold  for 

n > 1.  The  resulting  process  is  nonstationary.  The  process  {X^}  is 

easily  seen  to  have  zero  mean  and,  if  j a | < 1,  to  have  variance 
2 2 °°  2k  2 2 

a„  = ct„  Z,  „ a = a_/(l-a  ) , correlation  coefficient  r = E X X = 

X Z k=0  Z •>  n n-1 

2 2 2 

EX  , (Z  + a X , )/a„  = aE  X ,/a,,  = a,  and  autocorrelation 
n-1  n n-1  X n-1  X ’ 

2-1x1 

R (t)  = a a1  1 . A first-order  Gaussian  moving  average  source  is 

X La 

defined  by  the  difference  equation 
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X = Z + a Z . 
n n n-1 


(2.3) 


and  is  obtained  by  passing  (Z^}  through  the  first-order  moving  average 

filter  of  Fig.  5.  The  filter  is  stable,  the  resulting  process  {xn)  is 

2 2 

stationary  and  has  autocorrelation  R (k)  = E XX  = <r((l+a  )S  + 

x k l (J , k 

a5  . , + a8,  , } , where  5 , = 1 if  m = k and  0 if  m £ k. 

-l,k  l,k'»  m , k 

The  quantization  techniques  of  simple  quantization  (PCM) , predictive 
quantization,  delta  modulation,  and  differential  pulse  code  modulation 
(DPCM)  can  all  be  considered  as  special  cases  of  the  predictive  quanti- 
zation scheme  of  Fig.  6.  The  binary  coder  is  a fixed  rate  one-to-one 
operation  mapping  each  quantized  error  term  e^  into  a binary  k-tuple, 

yielding  a rate  log per  symbol  data  compressor.  A PCM  system  has 

/\ 

the  trivial  predictor  P^  = 0.  A DPCM  system  sets  pn  = ^ and  a 

delta  modulator  is  simply  a one-bit  DPCM  system  and  hence  both  systems 


simply  quantize  and  transmit  the  error  between  the  current  sample  X^ 

/S 

and  the  reconstructed  previous  sample  xn  ^ • A linear  predictive 

/\  /N 

quantizer  sets  P = a,X  , + a„X  + ...  , a scaled  version  of  the 
previous  reconstructed  samples.  In  general,  one  would  like  to  choose  an 
"optimum"  predictor,  but  the  complexity  of  the  feedback  loop  precludes 
an  exact  determination  of  such  a predictor.  As  a result,  most  designers 
use  as  the  predictor  p the  optimum  linear  predictor  of  given 

Xn-1 ’ Xn-2’#'‘  the  actual  Previous  samples  rather  than  their  reconstructed 

/S 

value.  The  argument  is  that  if  the  quantizer  is  "good,"  X^  ^ will  be 

nearly  X^  ^ and  the  optimal  linear  predictor  for  xn_^  should  also 
/\ 

work  for  X^  For  example,  if  (xn)  is  a zero  mean  stationary 

first  order  Markov  source,  then  the  optimal  estimate  X = f(X  ,X  „,...) 

’ ’ n n-1’  n-2’ 
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for  in  the  sense  of  minimizing  the  expected  squared  error 

2 

E(X  -P  ) is  well-known  to  be  f(X  ,,X  „...)=  r X , where 

n n n-1  ’ n-2  ’ n-1’ 

2 2 

r = E X X ,/E  X = R (l)/crv,  the  correlation  coefficient  of  the  source, 
n n-1  n x X’ 

We  shall  therefore  usually  confine  interest  to  a linear  predictive 
quantizer  of  the  form  of  Fig.  7.  Note  that  for  such  a system  the 
decoder  is  a memoryless  nonlinearity  (mapping  binary  symbols  into  a finite 
subset  of  real  numbers)  followed  by  a linear  filter  (with  feedback) 
satisfying  the  relation 


X = e + a X , 
n n n-1 


Z e . a 
k=0  n‘k 


(2.4) 


where  a = 0 (0  =1)  for  PCM,  a = 1 for  DPCM  and  delta  modulation, 
and  a = r for  an  optimum  linear  predictive  quantizer  for  a stationary 
zero  mean  first-order  Markov  source.  If  more  complicated  predictors  are 
iBed,  the  filter  defined  by  (2.1)  becomes  more  complicated.  Note  that 
even  in  the  simple  case  of  linear  predictive  quantization,  the  decoder 

/N 

has  the  form  X = g(Y  , Y , , Y „,...),  that  is,  the  decoder  has  an 
infinite  constraint  length. 

For  the  case  of  a Gauss  Markov  source,  it  is  well-known  that  the 

optimal  linear  predictor  yields  an  error  sequence  e = X - r X = Z 

n n n-l  n 

that  is  i.i.d.,  that  is,  "whitening"  X^  by  removing  the  optimal 
linear  prediction  yields  an  error  process  eR  (called  the  innovations 
of  {X^})  that  is  simply  the  original  white  process  {Z^}  used  to 
model  (Xj . As  previously  mentioned,  it  is  a standard  approximation 
for  high  rate  predictive  quantization  systems  for  Gauss  Markov  sources 


to  assume  = e^,  that  is,  the  quantizer  is  so  good  that 

^ A 

X _ = X and  hence  e = X -P  = X -r  X = X -r  X _ = e . This 
n-1  n-1  n n n n n-1  n n-1  n 


i 
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assumption  is  equivalent  to  replacing  the  predictive  quantizer  of 
Fig.  7 by  the  innovations  quantizer  of  Fig.  8 for  purposes  of  analysis. 
An  approximate  analysis  of  predictive  quantization  is  then  obtained  by 

using  asymptotic  quantizer  relations  [4]  on  e to  obtain 

2 ~ 

E(e  -q(c  ))  and  using  the  fact  e = e to  state 
n n n n 

E(e  -q(e  ))2  = E(X  -X  )2  = E(e  -q(e  ))2  (2.5) 

n n n n n n 

where  the  middle  equality  follows  from  the  fact  that 

A ^ 9 

E (X  -X  ) = E(X  -(r  X ,+  q(e  )) 

n n n n-1  n 

= E(e  - q(e  ))“  (2.6) 

n n 

The  complication  of  the  nonlinear  feedback  loop  in  the  system  has  thus 
far  precluded  a rigorous  proof  of  this  approximation,  but  it  has  been 
found  to  be  good  for  high  rate  systems.  In  addition,  the  innovations 
quan  izer  has  been  thought  to  be  itself  a reasonable  system  for  high 
rate  data  compression  since  it  is  known  that  for  Gauss  Markov  sources 
the  optimal  performance  as  given  by  Shannon's  rate-distortion  function 
is  the  same  for  the  source  and  its  innovations  [21],  suggesting  a good 
data  compressor  can  be  built  that  operates  on  the  memoryless  innovations. 
The  innovations  quantizer  is  considered  here  only  to  try  to  dispel  the 
myth  by  demonstrating  that  in  general  a data  compressor  operating  on  the 
innovations  in  a nearly  optimal  fashion  can  in  fact  perform  quite 
poorly  overall  (for  example,  in  the  Gauss-Markov  case  the  op.,.mal 
innovation  quantizer  will  perform  exactly  the  same  as  an  optimum 
quantization  of  the  process  itself)  and  that  neither  approximation  to  a 
predictive  quantizer  nor  an  interesting  but  not  well-understood  result 
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of  rate-distortion  theory  can  be  used  as  motivation  for  using  an 


innovations  quantizer. 

A tree  or  trellis  data  compressor  consists  of  a sliding-block 

decoder  g mapping  binary  sequences  into  reproduction  symbols  and 

a matched  tree  search  algorithm  [6,7,8,9,10].  The  encoder,  which  knows 

the  current  state  of  the  decoder,  uses  a distortion  measure  P (here 
2 

p(a,b)  = (a-b)  ),  a search  depth  L and  the  current  state  of  the 

decoder  register  to  find  the  available  sequence  of  path  labels,  say 

x_,...,x  , emanating  from  the  initial  state  (node)  of  the  tree  that 

_!  L-l 

minimizes  the  sample  distortion  L P(x^,x^)  between  reproduction 

sequence  and  the  source  sequence  x , . . . ,x  . If  it  is  a block  tree 
code  [6,7,8,9,10],  the  encoder  then  outputs  the  binary  "path  map" 

A A 

through  the  tree  forcing  the  decoder  to  output  x , ...,x  . If  it  is  an 

0 L- 1 

incremental  tree  code  [10,12,18],  the  encoder  outputs  only  enough  binary 

/A. 

symbols  (one  for  a rate  1 code)  to  force  the  decoder  to  output  x^. 

After  the  binary  channel  symbols  (L  or  1)  are  output,  a new  search 
is  performed  and  operation  continues.  The  actual  tree  search  may  be 
performed  via  the  Fano , Stack,  or  (M,L)  [15,  18,  29,  30,  1 (Ch.  6)] 
algorithms  or,  if  the  decoder  has  finite  constraint  length  and  hence  the 
tree  reduces  to  a trellis,  by  a Viterbi  algorithm  [9,10]. 

A natural  question  when  dealing  with  data  compression  systems  is 
what  the  "optimal"  performance  is  in  the  sense  of  minimizing  average 
distortion  (or  "maximizing"  fidelity)  for  a given  fixed  rate.  Given  a 
probabilistic  model  u for  a source,  a distortion  measure  P,  and  a 
fixed  rate  R,  an  unbeatable  lower  bound  to  the  average  distortion 
between  the  original  source  and  its  reconstruction  is  given  by  Shannon's 
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distortion-rate  function  D(R)  [1,2].  Roughly  speaking  D(R)  is  the 
smallest  average  distortion  over  all  "stochastic  codes"  or  random 
connections  between  a source  and  an  approximation  of  the  source  such 
that  the  average  mutual  information  rate  between  source  and  approxima- 
tion is  less  than  R.  In  addition,  coding  theorems  exist  showing  that 
performance  arbitrarily  close  to  D(R)  can  be  achieved  using  certain 
classes  of  coding  structures,  in  particular  using  sliding-block 
encoder  and  decoder  [11,12,22]  and  a sliding-block  decoder  with  a 
matched  block  tree  search  encoder  [ 10] . No  coding  theorems  exist  for  the 
incremental  tree  encoder,  but  simulations  have  often  shown  these 
encoders  to  be  superior  to  the  block  tree  encoder.  For  our  purposes, 
however,  D(R)  provides  an  absolute  yardstick  to  which  all  actual 


systems  can  be  compared. 


3. 


Plagiarized  Decoders 


As  previously  described  a natural  choice  for  a decoder  in  a tree  or 
trellis  encoding  data  compression  system  is  simply  the  decoder  of  a 
good  existing  system  such  as  linear  predictive  quantizer  or  delta- 
modulation,  As  the  decoder  for  such  systems  has  an  infinite  constraint 


length,  a tree  search  algorithm  is  required.  Alternatively,  one  may 
truncate  the  decoder  to  a finite  length  K by  replacing  (2.1)  by 


X = b E e .a 
n i=0  n'1 


(3.1) 


and  then  use  the  Viterbi  algorithm  to  search  the  resulting  trellis. 

The  scaling  constant  b allows  the  designer  some  extra  freedom  and 

* 2 

was  selected  so  as  to  experimentally  minimize  E(Xn-X^)  for  a given 
constraint  length  K.  In  this  section  we  concentrate  on  the  predictive 
quantizer  decoder  and  postpone  the  delta  modulator  decoder  to  the  next 
section  since  it  can  be  arrived  at  using  the  fake  process  approach  and 
we  prefer  to  call  this  decoder  the  Central  Limit  Theorem  decoder  (CLT) 
for  reasons  which  will  be  clarified  later. 

The  truncated  predictive  quantizer  decoder  is  described  in  Fig.  9. 
Simulations  were  performed  for  block-Viterbi  algorithm'  encoders  with 
decoders  of  lengths  3,4  and  5 and  first  order  Gauss  Markov  sources  with 
various  autoregressive  constants.  The  block  length  was  2000  symbols. 


The  experimentally  determined  values  of  b is  0.7-0. 8 and  the  SNR  in 


SNR  = 10  log 


10  E(X-X)2 

n n 


is  plotted  in  Fig.  10.  Also  plotted  for  comparison  is  the  performance 
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of  linear  predictive  quantization  with  predictor 


P = r X , 
n n-1 


and  an  optimum  quantizer  as  determined  by  Arnstein  [23].  The  distortion 
rate  function  is  also  plotted  for  reference.  Note  that  this  trellis  code 
system  outperforms  the  predictive  quantizer  by  about  1 dB  for  autoregres- 
sive constants  and  are  not  too  close  to  one.  This  is  because  truncating 
the  decoder  causes  the  system  to  be  unable  to  take  advantage  of  the 
large  amount  of  redundancy  when  the  autoregressive  constant  is  large. 

If  the  decoder  is  not  truncated  and  a tree  search  algorithm  is 
used  instead  of  a trellis  search,  then  the  performance  for  large  auto- 
regressive constant  is  improved.  An  incremental  exhaustive  tree  search 
was  used  in  the  simulation;  this  algorithm  has  the  advantage  of 
operating  in  a sliding  block  fashion  thus  requiring  a constant  computa- 
tion time  for  each  symbol  processed  and  thus  the  need  for  buffers,  is 
eliminated  in  practical  systems.  Some  tree  search  algorithms,  like 
the  Stack  and  Fano  algorithms  have  search  times  which  depend  on  the 
particular  source  sequence.  It  was  also  found  that  the  stack-algorithm 
required  much  more  computation  time  without  any  performance  improvement 
in  our  case  . 

Because  of  the  constant  computation  time  and  constant  delay  the 
incremental  tree  search  Ls  especially  attractive  for  practical  real  time 
systems  since  it  was  found  experimentally  that  for  sources  with  auto- 
regressive constants  not  too  close  to  one  a relatively  low  search 
depth  of  the  order  of  5 is  sufficient.  ThUS  for  each  input  symbol  the 


encoder  must  check  32  alternatives,  which  is  a reasonable  number. 


Simulation  results  for  the  predictive  quantizer  decoder  and  the 
incremental  tree  search  encoder  are  given  in  Fig.  11  for  two  search 
depths.  The  experimentally  determined  value  of  b is  0.8-0. 9 over 
the  whole  range. 

It  should  be  noted  that  the  predictive  quantizer  decoder-tree 
search  encoder  system  (which  can  be  denoted  by  LAPQ,  Look  Ahead 
Predictive  Quantization)  does  not  suffer  a deterioration  in  performance 
for  high  autoregressive  constants  as  there  is  no  decoder  truncation 
in  this  case. 


I 


J 

I 
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4 


Fake  Processes 


The  fake  process  approach  is  a design  philosophy  that  allows  one 
to  substitute  the  problem  of  designing  a tree  or  trellis  data  compression 
system  with  the  problem  of  designing  a filter  that,  when  driven  by  an  i.i.d. 
discrete  source  with,  say,  M equally  probable  letters,  will  produce 
a "good"  fake  of  the  process  one  wishes  to  compress. 

The  fake  process  technique  is  a means  of  mechanizing  the  theore- 
tical approach  developed  in  flOj.  Originally  this  design  problem  was 
called  the  simulation  problem  but  we  prefer  the  name  "fake  process"  to 
avoid  confusion  as  most  results  are  obtained  by  simulations. 

The  fake  process  technique  is  intimately  connected  with  the  notion 
of  a distance  between  stochastic  processes.  Several  definitions  of 
such  a distance  exist  but  the  one  applicable  to  our  case  is  the 
generalized  Ornstein  distance  [24], 

Suppose  that  we  are  given  two  stat ionary -ergodic  processes  f X } 

I and  (Y)  and  a single  letter  distortion  measure  p(*,*).  The  processes 

are  completely  specified  by  the  n order  marginal  distributions  v and 

pn  for  n = 1,2,3,...  for  f X } and  {Y}  respectively. 

th 

Now  define  the  n order  generalized  Ornstein  distance  by 

n-1 

P^X11,/1)  = inf  Ejn  n £ p(X1,Yi>|  (4.1) 

where  the  infinmum  is  taken  over  all  stationary  probability  distributions 
having  v”  and  as  marginals. 

Define  the  generalized  Ornstein  distance  p by 

P(X , Y)  = sup  p^  (4.2) 

n 
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Roughly  speaking  p is  the  minimal  distribution  between  the  processes 
(X)  and  {Y}  by  embedding  them  both  in  a super  process  that  yields 
{X}  and  { Y } as  coordinate  processes. 

Since  a coding  process  is  a means  of  statistically  joining  two 
processes  (the  source  and  the  reproduction)  this  notion  of  distance 
has  several  applications  to  information  theory  [24,  25]. 

Armed  with  this  notion  of  distance  we  can  now  define  a fake 
process*  Suppose  that  we  are  given  a stationary-ergodic  process 

(X)  and  a single  letter  distortion  measure  p(',*).  Let  {U}  be  an 
i.i.d.  process  uniformly  distributed  on  M discrete  levels  (thus  the 
entropy  rate  of  (U)  is  loggM).  Given  { U } and  e>  0 find  a filter 
(such  as  in  Fig.  12)  such  that  when  it  is  driven  by  the  process  fU) 

A 

the  output  process  { X ) satisfies 

A 

p(X,X)  = €0  £ € (4.3) 

A 

{ X } is  called  a fake  process  for  (X)  and  e is  the  distance  between 
the  real  and  fake  processes. 

If  D(R)  is  the  distortion  rate  function  of  the  process  (X) 
relative  to  the  fidelity  criterion  p(.,«)  then  it  can  be  shown  fio] 

A 

that  for  any  fake  process  f X } 


p (X , X ) * D(log2M) 


(4.4  ) 


for  any  choice  of  filter.  In  addition  for  a large  class  of  processes 
(including  all  Markov  processes)  given  any  e>  0,  a fake  process 
(or  a filter)  can  be  found  such  that 


p(X,X)  < D(log2M)  + 6 


(4.5) 
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The  appearance  of  the  distortion-rate  function  in  (4.4)  and  (4.5) 
immediately  suggests  that  the  fake  process  technique  should  be  applicable 


to  source  coding  (or  data  compression)  which  is  indeed  the  case.  Suppose 
we  have  a good  fake  process  (i.e.,  p is  close  to  D(log2M)).  Use  the 
filter  that  produces  the  fake  process  as  the  decoder  in  a rate 
log2M  data  compression  system  and  use  a tree  or  Viterbi  algorithm  as 
the  encoder.  Since  the  search  algorithm  essentially  finds  the  best 
path  through  the  tree  or  trellis,  the  performance  of  this  system  for 
a sufficiently  large  block  length  should  be  close  to  the  distance  between 

A 

(X)  and  (X)  which  in  turn  is  close  to  D(log2M),  thus  we  have 
designed  a good  data  compression  system  using  a good  fake  process.  The 
above  discussion  can  be  made  precise  as  is  done  in  r 10 ] to  show  that 
good  fake  processes  indeed  yield  good  data  compression  systems. 

We  have  shown  that  in  order  to  design  good  data  compression 
systems  one  can  reduce  the  problem  to  that  of  designing  good  fake  processes. 
In  many  cases  a good  fake  process  is  easy  to  arrive  at  because  of  a 
special  source  structure  or  property  and,  as  will  be  seen,  there  are  some 
natural  ways  to  obtain  good  fake  processes.  Unfortunately,  theoretical 
analysis  of  the  performance  of  the  suggested  systems  is  at  this  point 
impossible  because  of  the  analytical  difficulty  in  computing  the 
p-distance  except  for  simple  cases. 

It  also  might  be  worth  noting  that  we  have  obtained  theoretical 
bounds  on  the  fidelity  of  simulating  a continuous  physical  process  by 
filtering  a finite  entropy  i.i.d.  process.  Since,  for  example,  a 
computer  is  a discrete  machine,  so  all  computer-produced  processes 
used  to  simulate  continuous  processes  are  fake  processes  and  the  bounds 
(4.4),  (4.5)  apply. 
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The  C.L.T.  Decoder 


r 

I 


| 


| 


| 


The  Central  Limit  Theorem  (C.L.T.)  states  that  the  distribution  of 

the  sum  of  a large  number  of  properly  scaled  finite  variance  i.i.d. 

random  variables  is  well-approximated  by  a Gaussian  distribution.  To 

2 

be  precise  if  {Y^}  are  i.i.d.  with  zero  mean  and  variance  a then 

Pr  / — Sy  s a\  -»  J — e_t  /2  dt  (4.6) 

\ o/n  i=l  1 / -» 


as  n ->  ro. 

The  summing  and  scaling  can  be  thought  of  as  a sliding-block 
coding  operation  and  this  result  suggests  a means  of  generating  a fake 
Gaussian  process  with  an  entropy  rate  constraint.  The  resulting  filter 
(or  decoder)  is  shown  in  Fig.  13.  Since  we  have 


X 

n 


K 

Z 

i=0 


U 


n-i+1 


(4.7) 


t h 

the  filter  is  really  a K order  linear  moving  average  filter.  Also 
if  takes  on  the  values  +1,  +2,  +3,...,+  (^-1)  then  the  decoder 

is  really  the  truncated  DPCM  decoder.  For  the  case  M = 1,  this  is  the 
truncated  delta  modulation  decoder  and  it  is  interesting  that  one  can 
arrive  at  this  decoder  using  the  fake  process  approach. 

Note  that  it  is  not  claimed  that  (X^j  is  a Gaussian  process,  but 
only  that  it  has  approximately  Gaussian  marginals.  In  addition,  X^  will 
not  be  independent,  as  a matter  of  fact  the  following  interesting 
properties  of  the  C.L.T.  are  proved  in  Appendix  A: 

(1)  H(X ) = log^M  = H (U)  where  M is  the  number  of  discrete 

levels  the  process  (U  ] takes  on.  This  is  independent  of 

n 

the  shift  register  length. 
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(2)  The  autocorrelation  function  of  {X  } is  given  by 

n 

/ct2(k-|  t|  )/k  | t|  ^ k 

R X(t)  = E XqXt  = | 

f 0 elsewhere 

where  K is  the  shift  register  length. 

A 

It  can  also  be  shown  [26]  that  {X^}  is  not  a Markov  of  any  order. 

Because  of  the  strong  memory  dependence  it  seems  that  the  C.L.T.  fake 
process  should  be  used  to  code  sources  with  memory. 

Suppose  that  the  source  is  a first  order  Gauss-Markov  process  with 
autoregressive  constant  a.  If  U is  a Bernouli  process  (we  deal  with 
rate  1 data  compression  systems)  then  it  can  be  shown  [26]  that  the 
scaling  constant  b in  (3.1)  should  be 

b = - --  — (4.8) 

Vk(1- a2) 

if  we  wish  the  fake  process  to  have  the  variance  of  the  original  process 
(K  is  the  shift  register  length).  This  value  of  b is  in  very  good 
agreement  with  the  experimentally  found  value  and  is  given  in  Fig.  14 
for  a shift  register  of  length  5,  and  various  autoregressi /e  constants. 

The  performance  of  the  data  compression  system  consisting  of  the 
C.L.T.  decoder  and  a Viterbi  algorithm  encoder  was  evaluated  through 
simulations  using  several  blocks  of  length  2000  of  a first  order 
Gauss-Markov  source.  The  results  are  given  in  Fig.  14  for  various 
decoder  lengths  and  autoregressive  constants.  It  can  be  seen  that  the 
C.L.T.  system  outperforms  delta  modulation  by  about  2 dB  in  the  mid-range 
autoregressive  constants,  for  very  high  autoregressive  constants  the 
performance  of  the  C.L.T.  deteriorates  due  to  a short  shift  register  length. 
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It  is  possible  to  have  an  infinite  length  C.L. T.  decoder,  but  in 


this  case  the  trellis  structure  is  lost  and  a tree  search  is  used  as 
the  encoder.  We  chose  to  use  an  incremental  tree  search  for  reasons 
outlined  in  the  introduction  (this  results  in  a true  sliding  block 
system;  sliding  block  encoder  and  sliding  block  decoder).  We  chose 
to  call  this  system  LADM  for  Look  Ahead  Delta  Modulation  since  it  is 
basically  a delta  modulation  system  that  looks  into  the  "future"  to 
determine  the  current  output.  The  performance  of  the  system,  based 
upon  a simulation  of  10,000  input  samples  is  given  in  Fig.  15.  In 
this  example  a search  depth  of  5 was  used  as  we  found  out  that  bigger 
searches  contributed  negligible  improvement.  A constant  improvement  of 
about  1.5  dB  is  realized  (relative  to  delta-modulation),  as  a matter 
of  fact  even  for  a Wiener  sequence  (unity  autoregressive  constant) 

LADM  will  yield  distortion  (mean  square  error)  that  is  1.5  dB  below  the 


delta-modulation  distortion  (the  same  is  true  for  the  SNR  defined  as 


10  log. 


IjU  <vV2 

1 ~2 
n i=0  i 


).  Note  also  that  the  complexity  increase 


for  the  LADM  system  is  moderate  since  for  a depth  5 search  the  encoder 
has  to  search  only  32  possibilities  before  deciding  upon  the  output. 
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There  are  several  ways  to  modify  the  C.L.T.  decoder  in  such  a 


way  that  the  autocorrelation  of  the  output  sequence  is  decreased  thus 
making  these  decoders  candidates  for  compression  of  i.i.d.  Gaussian 
processes. 

I * 

■ Three  examples  are  shown  in  Fig.  16  . For  each  decoder  the  auto- 

correlation function  is  also  shown.  Assume  that  the  input  is  an  i.i.d. 
Bernouli  process  with  levels  +1  and  -1.  Note  that  the  decoder  in 
Fig.  16(a)  has  the  property  that  any  different  shifts  have  at  most 
one  common  bit  thus  the  autocorrelation  is  reduced  significantly.  The 
decoder  in  Fig.  16  (b)  is  a "brute  force"  way  to  obtain  zero  autocorre- 
lation by  using  the  first  bit  as  a sign  bit. 

Decoders  of  the  form  of  Fig.  16  were  tested  in  a data  compression 
system  with  a Viterbi  algorithm  as  the  encoder  and  an  i.i.d.  Gaussian  source. 
The  performance  was  quite  poor.  The  decoder  of  Fig.  16  (a)  and  16  (c) 
were  able  approximately  to  match  the  performance  of  the  optimum  quantizer 
for  shift  register  up  to  7.  The  decoder  in  Fig.  16  (b)  performs  about 
0.2  dB  worse  than  the  optimum  quantizer.  Note  that  as  far  as  distri- 
bution and  second  order  characteristics  (spectrum)  are  concerned,  the 
decoder  in  Fig.  16  (b)  is  a better  match  for  the  i.i.d.  Gaussian  process 
which  shows  that  even  though  distribution  and  second-order  characteristics 
can  be  used  as  a guideline  to  design  fake  processes,  a good  match  of 
those  characteristics  does  not  necessarily  yield  a good  fake  process. 


*The  first  two  were  suggested  by  L.D.  Davisson,  the  ast  one  by  J.  Dunham. 
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Inverse-Distribution  Decoders 


Given  u random  variable  X with  a continuous  cummulative  distribution 

Fx<x)  = Pr(X  s x) , then  the  random  variable  X can  always  be  modelled 

as  X = FX*(V)  where  V is  a uniformly  distributed  random  variable  on 

[0,1]  and  Fx^(a)  = x if  x is  the  smallest  value  for  which  Fx(x)  = a. 

In  other  words,  the  random  variable  F *(V)  has  a distribution  identical 

to  that  of  X since  (when  F is  continuous): 

F x(v)  = Pr {f”1 (V)  £ v))  = Pr{V  £ Fx(v)}  = Fx(v)  (4.9) 

FX 

when  F is  not  continuous  (4.9)  still  holds  [26], 

This  suggest  another  technique  for  faking  a Gaussian  process: 

consider  the  filter  register  as  a binary  expansion  of  [0,1]  and  take  F^  of 

the  contents.  We  focus  on  the  rate  1 case  for  simplicity.  Given  an  i.i.d. 

binary  (0,1}  random  process  (U^}  with  Pr(Un=0)=Pr (11^=1)  = 1/2  the 

contents  of  a length  K shift  register  say  (U  , U , ,...U  ,) 

n n-1  n-K+  + 

can  be  considered  as  a finite  approximation  v^  to  a uniform  variable, 

that  is  the  binary  K-tuple  (U  ,U  , ,,  , ) is  considered  as  a 

n n-1’  n-K+1 

binary  representation  of  a number  between  0 and  1 via 


K'l  -i-1 

£ <U  .2  1 ) +2 

i=0  n_i 


- (K+l ) 


(4.10) 


where  the  bias  2 is  used  for  technical  reasons  to  avoid  getting 

~ (K)  ~(K) 

v =0.  For  a fixed  K,  v is  a discrete  random  variable  uniformly 

n n 

— K — fK+1}  K 

distributed  on  the  numbers  (12+2  , i=0,l,2...,2  -1}.  In  the 

—I  ~ (K) 

limit  as  K -►  » the  distribution  for  F (v'  ) converges  to  that  of 

X u 

Fx  (V)  which  is  Fx>  Thus  in  the  limit,  for  large  K,  the  sliding- 
block  coder  of  Fig.  17  driven  by  a binary  i.i.d.  equiprobable  sequence 
(Un)  should  produce  a process 


X = F~1(vfK))  = F“1(  f U , ,2_1+  2 


■ (K+l ) 


n-i-1 


(4. 11) 


I 

with  a distribution  which  is  approximately  Fx  and  which  has  the 
following  properties  [26]  which  are  proved  in  Appendix  B: 

(1)  (X  } is  a first-order  Markov  process. 

1 n 

(2)  The  entropy  of  X is  given  by 

j H(X)  = H(U)  = 1 

for  a driving  binary  equiprobable  source. 

I 

(3)  The  process  (xn)  has  the  distribution  of  an  equal  area 
(maximum  entropy)  quantization  of  the  distribution  F^  where 
the  output  levels  are  chosen  so  as  to  minimize  the  average 
absolute  error. 

(4)  The  autocorrelation  function  Ra(t)  satisfies  R~(r)  = 0 

X X 

as  t s K for  decoder  length  K.  For  K = 00 , R.(x)  -»  (e(x)]2 

X 

as  T -*  00 ; however,  the  rate  of  convergence  depends  upon  the 
distribution  itself. 

Obviously,  successive  samples  of  the  fake  process  cannot  be  independent 
| since  each  successive  binary  K-tuple  is  a shift  of  the  previous  with 

one  new  symbol  going  in  and  the  last  symbol  being  shifted  out.  Here, 
however,  a simple  technique  can  be  used  to  at  least  cause  the  X ' s 

A 

to  be  uncorrelated  (or  "linearly  independent")  and  hence  ( xn ) will 
be  a pseudo-white  process  in  the  sense  of  having  a flat  power  spectrum 
and  approximately  the  right  marginal  distribution.  This  rather  odd 
process  will  be  a candidate  for  coloring  the  trellis  for  memoryless 
sources  and  will  serve  as  a building  block  to  color  the  tree  for 
sources  with  memory. 

The  basic  idea  for  modifying  the  system  of  Fig.  17  is  to  introduce 
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a "scrambler"  In  order  to  mix  up  the  influence  of  the  binary  symbols  in 
the  shift  register,  hopefully  so  as  to  decorrelate  the  resulting  process. 
Two  forms  of  scrambling  are  possible:  using  a block  scrambler 

k a 

or  (0.1)  -*  (0,1}  on  the  binary  K-tuple  as  in  Fig.  18  or  to  form  a 

scrambling  function  g:[0,l]  -»  [0,l]  which  scrambles  the  discrete 
uniform  variable  prior  to  passage  through  the  inverse  distri- 

bution function  as  in  Fig.  19. 

The  class  of  all  block  scramblers  of  Fig.  18  is  contained  in  the 
class  of  all  scrambling  function  scramblers  as  far  as  the  input  to  the 
inverse  distribution  operator  is  concerned  [26],  hence  we  focus  on  the 
second  class.  The  existence  of  decorrelating  scramblers  is  guaranteed 
by  the  following  theorem  which  is  proved  in  Appendix  C. 

Theorem  4.1; 

If  the  X 's  are  random  variables  with  a continuous  distribution  fune- 
n 

tion  F (x)  that  is  anti-symmetric  about  X = i (i.e.  , the  probability 
X 2 

density  function  fY(x)  = dF  (x)/dx  is  symmetric  about  the  mean  EX, 

X X 

zero  in  our  case),  and  if  a simulation  coder  (fake  process)  is  constructed 
as  in  Fig.  19  with  g any  function  satisfying 

g(x)  + g(x  + /a)  = 1 0 £ x < % (4.12) 

Then  the  resulting  process  is  uncorrelated,  that  is: 

A A 

E X X = 0 for  all  K ^ 0 (4.13) 

n n+K 

We  note  that  the  assumptions  on  Fx(x)  are  f°r 

the  Gaussian  processes  considered  and  also  for  other  sources  of  interest. 
Also,  additional  constraints  must  be  imposed  upon  the  scrambling  function 


28 


g (in  addition  to  (4.12))  if  we  want  the  output  of  the  scrambler  to 


have  uniform  discrete  distribution  and  also  to  have  the  maximum  entropy, 
i.e. , 1 bit/symbol.  Let  [a]  denote  the  integer  part  of  a,  let  K 
be  the  decoder  length  and  let  A be  the  set 

Let  g:A  -*  R then  we  have  the  following  lemmas,  which  are  proved  in 
Appendix  D. 

Lemma  4.1; 

The  fake  process  of  Fig.  19  will  have  an  entropy  of  1 bit/symbol 
if  the  scrambling  function  satisfies 

g (x  + 1/2)  4 g(x)  (4.14) 

for  xeA  and  x<  1/2. 

Also  if  we  let  ||s||  denote  the  cardinality  (number  of  elements)  in 
the  set  S and  then  define  g 1 f X ) = (y  e A:g(y)  = X]  we  obtain  the 
following  lemma. 

Lemma  4,2; 

The  output  of  the  scrambler  will  be  a uniform  discrete  distribution 
if  and  only  if 

||g"1{X)||  = 1 1 g~ 1 { Y )||  for  all  X,Yeg(A)  (4.15) 

g(A)  is  the  range  of  the  function  g. 
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fc. 


Some  examples  of  scrambling  functions  are  given  in  Fig.  20.  The 

reader  can  easily  convince  himself  that  all  the  functions  in  Fig.  20 

satisfy  the  conditions  in  Lemmas  1 and  2.  We  concentrate  on  scrambling 

functions  of  the  type  b and  c.  Note  that  those  functions  are  periodic 

on  [0,1]  and  that  these  functions  will  satisfy  the  conditions  (4.14) 

and  (4.15)  for  any  number  of  periods.  It  was  found  out  experimentally 

that  for  a shift  register  of  length  K the  best  results  were  obtained 

K—  1 

when  the  number  of  periods  of  the  scrambling  function  was  2 +1 

K—  1 

or  2 -1.  In  this  case  the  output  of  the  scrambling  function  of 

K— 1 

type  b contains  only  2 letters  (because  of  symmetry)  while  the 

output  of  a type  c scrambling  function  contains  2 letters. 

Since  the  fake  process  obtained  by  the  filter  in  Fig.  19  is  pseudo- 
white we  tried  to  use  this  decoder  in  a trellis  block  code  system  for 
the  hard  case  of  compressing  white  Gaussian  noise. 

Using  simulations  it  was  discovered  that  the  scrambling  function 
of  types  b and  c yield  systems  with  almost  identical  performance  (the 
small  deviations  can  be  attributed  to  errors  resulting  from  simulation). 
Thus  we  concentrate  on  the  somewhat  simpler  class  of  functions  of  type 
b for  the  rest  of  this  report. 
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i.i.d.  Processes 


The  filter  in  Fig.  19  using  the  scrambling  function  of  Fig.  20  (b) 
was  used  as  a decoder  in  a data  compression  system  where  a Viterbi 
algorithm  was  used  as  the  encoder.  As  usual  a scaling  factor  was 
added  to  the  output  of  the  filter  to  allow  for  an  additional  degree  of 
freedom.  This  scaling  factor  was  optimized  experimentally  and  it  was 
found  that  for  an  i.i.d.  Gaussian  source,  this  factor  was  in  the  vicinity 

I 

of  .9  for  all  register  lengths  which  were  tried. 

Simulations  were  carried  out  for  several  blocks  of 
length  2000.  The  results  are  given  in  Fig.  21  where  a comparison  is 
made  to  the  optimal  Max  quantizer  [27]  and  the  rate  distortion  bound. 

For  short  shift  registers  the  improvement  is  very  fast;  however,  for 
longer  shift  registers  the  improvement  is  very  slow  and  it  was  impossible 
(because  of  memory  and  computing  time  limitations)  to  discover  whether 
further  improvement  is  possible  for  even  longer  shift  registers. 

We  see,  however,  that  the  improvement  over  the  optimal  quantizer  is 

J 

about  0.7  dB  and  that  the  resulting  performance  is  only  0.9  dB  below 
the  rate  distortion  bound.  As  far  as  we  know  this  is  the  only  non-random 

I 

source  coding  scheme  that  has  been  shown  to  beat  the  optimal  quantizer  for 
this  source  at  a rate  of  1 bit/symbol. 

It  is  interesting  to  note  that  if  we  use  the  "forward  channel" 
approach  [ 1 , pg.  101 ] to  evaluate  the  "optimal"  distribution  at  the 
output  of  this  channel  this  turns  out  to  be  Gaussian  with  standard  devia- 
tion of  0.87.  Our  decoder  has  an  asymptotically  Gaussian  distribution 
with  a standard  deviation  of  about  .9.  This,  rather  close,  agreement 
suggests  that  one  should  try  to  fake  the  process  that  arises  in  the  calcu- 
lation of  the  rate  distortion  function  and  not  the  input  process  itself. 
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Autoregressive  Sources 


An  autoregressive  source  was  defined  by  equation  (2.1).  Note 
that  the  process  X is  also  the  output  of  a linear  filter  with  the 
transfer  function 

H(z)  = — (4.16) 

1-az 

when  driven  by  an  i.i.d.  process. 

This  leads  to  the  possibility  of  faking  an  autoregressive  process 
by  passing  the  output  of  the  filter  in  Fig.  19  through  the  linear 


filter  of  (4  16)  (see  Fig.  22(a)).  It  can  be  noted  that  this  is  somewhat 
reminiscent  of  the  innovations  quantization  approach  which  is  extremely 
inefficient  for  compressing  autoregressive  sources  (for  Gaussian 
sources  the  performance  of  such  a system  is  the  same  as  simply  optimally 
quantizing  the  process  X)  but  because  of  the  tree  search  at  the 
encoder,  our  system  is  not  an  innovations  quantizer. 

Another  fake  of  an  autoregressive  source  can  be  obtained  by  noting 
that  since  this  is  a Markov  process,  we  can  use  the  conditional  inverse 


distribution  Fv  , „ (•  x ,)  instead  of  the  marginal  in  Figure  19. 

XX  , 1 n-1 

n'  n-1 


Of  course,  the  conditioning  is  made  on  the  reproduced  past  x^  ^ . 

The  filter  resulting  from  this  approach  is  given  in  Fig.  22  (b). 

The  filters  in  Fig.  22  (a)  and  (b)  turn  out  to  be  completely 
equivalent  [26],  i.e. , if  they  are  driven  by  the  same  input  sequence 
they  will  yield  the  same  output  sequence. 

Another  interesting  result  concerning  the  filters  in  Figs.  22 
is  that  if  the  driving  sequence  fU)  is  i.i.d.  and  uniform  (on  its 
discrete  alphabet),  then  the  output  will  have  an  exponential  auto- 
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correlation  function  (this  is  a direct  application  of  the  fact  that 
the  output  of  the  scrambling  function  is  a pseudo-white  noise).  Thus 
by  definition  the  output  is  a wide-sense  Markov  process  [28] . 


Note  that  since  the  output  of  the  scrambling  function  is  only 

uncorrelated  (and  not  independent),  the  marginal  distribution  on  f X ) 

n 

is  not  the  same  as  the  marginal  of  X ] even  when  K -*  00  (K  is  the 

n 

shift  register  length).  A comparison  between  these  two  distributions 
is  made  in  Fig.  23  for  the  case  where  F (• ) is  the  Gaussian  distri- 
bution, the  decoder  has  a length  8 shift  register  and  the  auto- 
regressive constant  is  0.8  (results  were  obtained  by  simulating  a 
sequence  of  length  100,000). 

Even  when  the  shift  register  of  the  filter  in  Fig.  22  is  of 
finite  length,  the  whole  filter  is  not  a finite  state  machine  because 
of  the  feedback  loop  at  the  output.  As  a result  when  this  filter  is 
used  as  a decoder  in  a data-compression  system,  one  has  to  use  a tree 
search  as  the  encoder  (it  is  possible  to  use  truncation  also  in  this 
case  and  use  a Viterbi  Algorithm  encoder;  however,  as  we  have  seen 
this  results  in  poor  performance  in  the  higher  range  of  autoregressive 
constants). 

An  incremental  tree  search  was  used  in  simulating  data  compression 
systems  using  the  filter  in  Figs.  22  as  the  encoder.  Tests  were 
performed  on  several  blocks  of  length  2000  and  the  given  results 
are  the  average  of  five  tests.  Results  are  given  in  Figure  24 

for  search  depths  of  5 and  H.  In  both  cases  the  system  outperforms  the 
predictive  quant izer, but  the  gain  is  only  about  0.3  dB  for  the  depth 
5 search  while  it  is  approximately  1 dB  for  the  depth  8 search.  In 
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additional  simulations  it  was  shown  that  even  the  depth  5 search  will 
yield  about  1.5  dB  gain  (in  distortion)  as  compared  to  the  predictive 
quantizer  in  the  case  of  a Wiener  sequence  (a  = 1). 

As  in  the  i.i.d.  case,  the  optimal  scaling  factor  was  found  to 
be  in  the  vicinity  of  .9. 

Moving  Average  Sources 

A first-order  average  source  satisfies  equation  (2.3).  We  use 
the  filter  in  Fig.  25  to  obtain  a fake  moving  average  process.  In  this 
case  the  filter  is  a finite  state  machine  for  a finite  shift  register 
length  and  therefore  it  is  possible  to  use  a Viterbi  algorithm  as  the 
encoder  (the  Viterbi  algorithm  is  usually  preferred  when  applicable 
because  it  is  optimal) . 

Simulation  results  are  given  in  Fig.  26  for  a first-order  Gaussian 
moving  average  source  and  a length  6 decoder.  The  given  results 
are  the  average  of  five  tests  of  length  2000  each.  The 
rate  distortion  function  was  calculated  numerically  using  the  parametric 
algorithm  outlined  in  [l,  p.  115],  Results  are  also  given  for  delta- 
modulation,  but  as  can  be  expected  the  performance  is  even  poorer  than 
direct  quantization.  Our  new  system  outperforms  the  optimal  quantizer 
by  as  much  as  2 dB.  This  is  a considerable  improvement  for  such  a 
low  rate  and  is  obtained  in  a case  where  the  common  data  compression 
schemes  perform  poorly. 
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Summary  and  Conclusions 


The  main  purpose  of  this  report  was  to  motivate  and  describe  the 
design  of  data  Compression  system  using  the  fake  process  approach. 

Results  were  presented  for  a system  designed  to  compress  various 
Gaussian  processes  as  those  processes  are  widely  used  to  model  real 
life  phenomena.  The  fake  process  approach  is  not,  however,  limited  to 
this  type  of  source  (with  the  exception  of  the  C.L.T.  decoder  which 
uses  characteristics  unique  to  the  Gaussian  process).  The  results 
that  were  presented  for  low  rate  systems  are  very  encouraging  and  the 
added  complexity  is  very  moderate.  The  next  logical  step  would  be  to 
obtain  results  for  higher  rate  systems,  but  very  high  rate  systems 
are  not  practical  as  the  complexity  of  trellis  and  tree  searches  grows 
exponentially  fast  with  the  bit  rate.  Also  we  would  like  to  obtain 
results  for  "real  life''  compression  schemes,  such  as  image  or  speech 
compression  using  the  fake  process  approach. 

For  example,  to  build  a speech  compression  system  "all"  that  is 
needed  is  a (possibly  nonlinear)  filter  which  when  driven  by  a discrete 
i.i.d.  source  will  produce  a process  that  "sounds  like"  speech.  There 
is  a twofold  problem  with  this  approach;  first  an  appropriate  distortion 
measure  p must  be  established  such  that  the  p-distance  is  a good 
measure  for  the  "closeness"  of  different  sounds;  and  second  a filter 
must  be  designed  such  that  its  output,  when  driven  by  a discrete  i.i.d. 
source,  is  close  in  the  p sense  to  human  speech.  If  this  approach 
can  be  realized,  it  can  become  an  alternative  to  the  current  LPC 
speech  compression  svstems  which  also  require  a considerable  computational 
effort. 
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Appendix  A 


Properties  of  C.L.T.  Decoder 


(1)  Obviously  the  value  of  the  scaling  factor  b does  not  effect  the 
entropy  (as  long  as  b ^ 0) . 

Let  S = (u  ,u  u , ,)  denote  the  state  of  the  shift 

n n n-1 ’ ’ n-L+1 

register  (i.e.,  the  values  of  the  shift  register  cells).  Note  that  if 

S is  given,  then  S , can  take  on  M distinct  values  depending 
n ’ n+1 

A 

on  the  next  value  un+1-  Since  X^  depends  only  on  Sr  and  given 

S . X can  take  on  M distinct  equally  probable  values  we  have 
n’  n+1 

H(X  /S„  ,)  = log„M  = H(U) 

n n-1  2 


Thus 


H(X)  = lim  II(X  /X  ,X  ,,...);>  lim  H(X  /X  .,X  „...S  lPS  . 

n n-1  n-2  n n-1  n-2  n-1  n-2 

n ->  oo  n -> 


. ) 


lim  II  (X  /S  _ ,S  ) 

n n-1  n-2 

n ->  oo 


lim  H (X  /S  ) = H(U) 

n ->  » n n-1 


So  we  have  H(X)  ^ H(U).  Since  we  have  the  reverse  inequality,  Property  (1) 
is  proved. 


(2)  Since  the  u 's  inside  the  shift  register  are  zero  mean  and 
n 

independent  the  variance  of  the  sum  is  the  scaled  sum  of  variances,  so: 


2 

X 


= MO)  = bV-K 
X 


Also 


n+t 


K-l 

= b T,  u 

i=0 

K-l 

= b Tj  u 


n-i 


i=0 


n+t-i 


assume  t > 0 (we  obtain  the  autocorrelation  function  for  t < 0 by 
using  symmetry) 


o 

K-l 

K-l 

R ft)  = EXX  . = b E( 

£ n n+t 

E u . 

n n-i 
i=0 

E u . .) 
ilo  n+1_1 

Since  the  u. 's 

1 

are  i . i . d . 

and  zero 

mean  all  the  terms  in  the 

product  vanish  except 

2 

those  of  the  form  u. . If  t s K there  are  no 

l 

such  terms  and  R^(t) 

= 0.  If 

t < K: 

X 

K-l  K-l 

9 K_1 

K-l 

b E(  E u E 

i=0  i=0 

u . ) = 

n+t-i 

b2E(  E 
i=0 

u . E u , . ) 

n-i  . _ n+t-i 

1=0 

9 K-1 

J-t-1 

= 

b2E(  E 

u . E u . ) 

"-1  it0  n"1 

i=0 

2 2„  K-t 


b (K-t)fTu  = b auK  — 


and  this  is  the  desired  result. 
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Appendix  B 


Properties  of  Inverse  Distribution  Decoders 


(1)  Let  V = "Zj  2 lun_i+2/  By  definition, 
i=l 


X = F (V  + 2 
n X n 


-(K+l) 


) 


(B.l) 


Define 


<Vn/2> 


n 7K 


n 

2 


V 2K 


if  2^  i 


s even 


if  2*Sf  is  odd 


(B . 2) 


If  we  have  a length  K shift  register  containing  (u  ,un_^. 


,u 


n-K+1 


) 


the  V s 1 is  a number  whose  binary  expansion  (after  the  binary  point) 

is  given  by  (u  ,u  . . . ,u  ) and  (V  /2)  is  a number  whose 

n n ~ i n- i\+ 1 n k 

binary  expansion  is  given  by  shifting  a zero  into  the  shift  register. 

Note  that  there  is  a one-to-one  correspondence  between  and 

(u  ,u  u „ , ) . Using  the  definition  (B-2)  we  obtain 

n’  n-1  ’ n-K+1 

u 

V = (V  ,/2\  + -2 

n ' n-1  'K  2 

Since  fu  1 is  i.i.d.  it  follows  that  (V  ) is  a first  order  Markov 

1 n ’ L nJ 

-1  * 

chain.  The  function  F„  (•)  is  one-to-one,  thus  X which  is  obtained 

X ’ n 

by  (B.l)  is  also  first  order  Markov  since 

5n  = FX1(V  2"(K+1))  = FX1(<Vl/2>K  + T + 2"<K+1)) 


(B.  3) 


So  X depends  only  on  X and  the  next  input  u to  the  filter, 
n J n-1  n 


The  conclusion  remains  valid  as  L -»  ce. 
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r 


Z’' 


(2)  Since  X is  first  order  Markov  then  [23  Section  6.4], 
n 


«<»  = K.ivv,1  • 


<B.  4) 


If  X is  Riven,  X can  take  on  only  two  distinct  values 

n-1  n 

,-(K4l>\ 

, t 2-<K>»)  H u =0 


probability  distribution  on  those  values  will  be  the  same  (given  X 


as  the  probability  distribution  of  on  [0,1]  so, 


H (X  / x ) = H(u  ) “ I! ( U)  as  [U  ) is  i.i.d. 

n n-i  n n 


Since  II (X  /x  ) is  independent  of  x , using  (5.15)  we  have 
n n-1  n-1 


11(X)  = H (U)  E p (X  . = x ) = II  (U) 

* n-1  n-1 

X 


n-1 


Since  we  always  assume  P{un~  °]  = p(u^=  1]  = l/2,  wt>  have 


ls  a 


H(X)  = 1 bit/syinbol. 

The  conclusion  remains  valid  as  L _+  «o . 

Note  that  in  the  case  where  K = »,  V = „ 2 iu  . is 

’ n i=0  n-i+1 

continuous  random  variable  uniformly  distributed  on  [ 0 , 1 ] . Since  V 

A -1 

is  continuous  then  X = F„  (V  ) is  continuous,  thus  we  have  an  example 

n X n 

of  a continuous  amplitude  process  with  an  arbitrary  distribution  having 
a finite  entropy. 
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(3)  Let  SK={-ir;  + ■ , i=0,l  . . . ,2K  . First  note  that  if  a 

2K  2 +1 

K — K K 

is  uniform  on  S then  Pr(Q!  = Oi q)  = 2 (since  there  are  2 members 

K -1 

in  S ).  Since  Fx  is  monotonic  and  continuous  it  is  also  one-to-one 

—I  “1  K 

hence  F (•)  is  one-to-one.  Thus  F (a)  takes  on  2 distinct 

A A 

values  with  equal  probability.  If  we  choose  any  K bit  equal  area 
quantizer  of  X then  it  also  will  have  2 distinct  outputs  with  equal 


probability. 


The  decision  level  for  the  equal  area  quantizer  are  determined  in 


the  following  way.  Assume  the  decision  levels  are  -»  = a^ ,a^ ,a^ , . . . , 


= oo  where  for  a K-bit  quantizer  N = 2K  and  we  have 

W = p;  K - o.1.2 2K 


(B.5) 


(This  is  the  equal  area  condition.) 

Suppose  that  the  quantizer  outputs  are  ,i  , . . . where 

& ^ is  output  if  the  input  is  in  the  interval  (a^a^^).  The  exPected 


absolute  error  is  given  by 

N ai+l 


e = E | I Wi  |dFx(t)  (B.  6) 

i=l  a . 

l 

To  minimize  (B.5)  we  have  to  minimize  each  term.  Rewriting  the  i*h 
term  in  the  sum  and  taking  derivatives  with  respect  to  the  yields 


J dFx(t)  + j dFx(t)  = 0 


rx(a.)  - 2Fx(i1>  - Fx(«1+1)  = 0 


Using  (B.5)  we  obtain 


t 


w = 


Fx(ai)  + Fx(a.+1) 


i_  1_ 

2K  + 2K+1 


thus 


-1/  L 1 ' 

£i  = Fx  (jL  + JC+1, 


(B.  7) 


2 2 
\ ' 

which  is  the  desired  relation.  In  order  to  complete  the  proof  we  take 
second  order  derivative 

a2s. 

*r-  * ‘ 2W  < 0 

i 

So  indeed  the  i . that  we  found  by  (B.7)  are  the  ones  that  minimize 

l 

the  absolute  error. 


(4)  As  before  let 


/ = £ 2 u 

“ iti  n-i+i 


and 


X = F"X(V  ) 
n X n 

By  definition  (assume  r > 0) 

M*>  = - E FX1(V„>FX1<Vn*r' 

is  uniformly  distributed  on  [0,1].  Also 


(B . 8) 


n+x 


2X  2X 


where  j is  uniformly  distributed  on  0 , 1 , . . . ,2X-1 . (This  follows  from 

the  fact  that  u^  is  i.i.d.  and  uniform  on  (0,1].)  Since  j depends 

only  on  u ,u  , , . . . ,u  ,,  j is  independent  of  V . Thus  (B.8) 

J n+r  n+r-1  ’ n+1  ’ n 


becomes 
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R,(t)  = E F V )F  M — + —r 

X X n X \2T  2T 


- * ^IV> 


= f f”1  ( t)  — ^ f_1(  — + M 

0 X 2T  j=0  ' 2T  2T ' 


(B . 9) 


Assume,  for  the  moment,  that  T F (t)dt  exists  and  is  finite  (we 

° 

prove  this  later  on);  then  since  F^  (•)  is  monotonic  non-decreasing 

±-  rVf ^ = i-  eV(I-).  jfiW 

2T  j=0  X '2T  2T  2T  j=0  X ' 2t'  2T  j=l  X ' 2T  t)  X 

(B . 10) 

as  K -»  oo.  This  is  by  the  Riemann  sum  approximation  to  the  integral 


also  for  t > 0 (let  t ^ e > 0) 


H rV  ('-  * 4 ^ 2e  1 - 1 Fx1(*> 

2t  j=0  X \2t  2t)  2t  j=0  ' ^ ^ 

- |Fx1(z> 


(B. 11) 


dz  as  £ -»  0 . 


Equations  (B.10)  and  (B.ll)  show  that 


lim  — S F-1  — + — = f F~1(z)dz  all  t e [0,1] 

T -»  - 2T  J=0  2T  2T  0 X 

1 _! 

By  assumption  = [ F (z)dz  is  finite.  Hence  by  using  dominated 


convergence  theorem: 

] 

lim  R^ ( t)  = I 
T ->  00  X ( 


f f”1  ( t)(  lim  £ V1^  + L.\ 

*>  X \t-.-2T  j=0  \2T  21// 


i Fx,<t)dt  i 


l Fx1(z)dz  = r | F-1(t)«]i 


Setting  t = Fx(z)  yields 


f F-*(t>dt  = J z dFx(z)  = 

0 —oo 
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Appendix  C 


fr 

' 


Proof  of  Theorem  4.1; 

Let  K denote  the  shift  register  length.  The  process  {X^}  is 
stationary  since  it  is  a sliding  block  coding  of  an  i.i.d.  process. 
Thus  by  definition  the  autocorrelation  function  is  given  by: 


R* ( i ) = E X(0)X(i)  (C.l) 

X 

Let  denote  the  state  of  the  shift  register  given  by 


s„  - <Vu»-i Vw> 


(C . 2) 


equivalently  the  state  can  be  specified  by  a single  integer  given 


by 


K 

J = £ U .2 

n . n-m+1 

m=l 


K-m 


(C . 3) 


First  consider  the  case  i < K. 

Let  [a]  denote  the  integer  part  of  a,  then  we  have  the  following 
relation  between  any  two  states: 


j 


n+j 


= [j  /2J]  + E u 


K-m 


m=l 


n+j-m+l 


(C . 4) 


for  1 5 J < K. 

Using  the  fact  that  (U  } is  an  i.i.d.  Bernouli  sequence  with 

n 

p(U  =0)  = p(U  =1)=  ^ we  can  rewrite  (A.l)  in  terms  of  the  states  (J  }. 
n n 2 n 


(C.  5) 
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To  show  that  R-(i)  =0  it  will  suffice  to  show  that 


i'  *?•?»))• 


Break  the  sum  in  (C.6)  into  two  sums: 


%'W1#1  •?*?>)) 


(C.6) 


(C.  7) 


Changing  index  in  the  second  sum  in  (C.7)  we  obtain: 


^ „-l/  / 1 P/2 1 ] m 1 \\  V1  /[P/21]  m 1 1 \\ 

«—  *7*7^)-  So  F*  + 7 * * + H) 


(C.8) 


and  substituting  (C.8)  into  (C.7)  yields: 


£ •?•?"))■  £ ■ H'  (■(ir!  • 7 • > • rt) 

• • 7*  ■ i))j  • «•» 

K i — 1 

But  we  assumed  p ^ 2 -1  and  also  we  have  m £ 2 -1.  Thus 


[P/21]  m_  J_  1 _1_  1 

2K  + 21  + 2K+1  2 ' 2K+1  2 


(C.10) 


Thus,  using  the  assumed  property  of  the  scrambling  function  we  have: 
( f P/2 1 ] m 1 l\  , _/[p/211  m 1 \ 

*V  2K  +2i+2K+1+2)"  " g\  2K  +21+2K+1) 
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But  the  distribution  F was  assuned  to  be  continuous  and  anti- 

A 

symmetric  around  X = i.  This  implies 


Fx1<t)  = -Fx1<1-t> 


0 £ t £ 1 


(C . 12) 


Thus  using  (C.ll)  we  have: 


W1- ?•*.•})■  'WP  ' ? '*4 


Substituting  (G13)  into  (C.5)  we  see  that  all  the  terms  in  the  sum 
are  zero  implying  the  sum  in  (C.  6)  is  zero. 

In  the  case  when  i £ K the  states  and  are  independent 

which  implies  that  and  X^  are  independent  random  variables  thus 


V1’ 


(E  X(0))  i :>  K 


but 


E X 


<o>  = ? V F*{et*  * -M) 

2 m=0  \ \2  2 // 


CC. 14) 


(C . 15) 


The  sum  in  (C. 15)  can  be  shown  to  be  zero  by  the  same  procedure  that 
the  sum  in  CC. 6)  was  shown  to  be  zero. 
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Appendix  D 


Proof  of  Lemmas  4.1  and  4.2 


Lemma  4 . 1 


We  always  have 


H(X)  £ H(U)  = 1 


Define  as  in  (C.3)  since  {X)  is  stationary: 


H(X)  ;>  H(X  /J  n) 
n n-  i 


Once  J , is  given  X can  take  on  at  most  two  distinct  values 
n-  i n 


C,  t,-1|  l Jn-1  ! /r>K  Un 

Xn  = FX  \e\  [— j 72  + 2- 


Since  F is  one-to-one,  X will  take  on  two  distinct  values  with 
X n 

s A 

equal  probability  for  any  given  J , if  (4.14)  holds.  So  H(X  /J  ,)  = 1 

n— i n n-i 

if  (4.14)  holds.  QED. 


Note  that  since  (4.12)  requires 


g(a  + 1/2)  + g(a)  =1  , a s 1/2 

in  order  to  satisfy  (4.14)  we  only  have  to  require  g(a)  / 1/2  for 


Lemma  4 . 2 


The  proof  is  straightforward.  The  input  to  the  scrambling  function 
is  uniformly  distributed  on  A.  Suppose  (4.15)  holds  then  let  x,y  e g(A) , 
with  Z denoting  the  output  and  V the  input  to  the  scrambling  function. 


FT 


iyz=--x} 

_ JkliiM 
2L 

On  the  other  hand 
finite  sot  g(  A ) . 
so  P (V  c:  g 1(y)] 
sot  this  Implies 


= P { g (V)-xj  = P„(V  o g (x)}  - - 

1 1 2 

= Pr(v  c g_1 ( y) ) = Pr(z=y]  Vx,y  c g(A  ) . 

assume  that  tlie  output  is  uniformly  distributed  on  thr 
This  implies  for  any  x,y  •:  g(A),  pr{z=x]  " ,,r{z=3 
= P {V  c g 1 (x)} . Since  V is  uniform  on  a discrete 

lie-1  il  = lle_1(y)||  • 


Qi:n. 


Fig.  3 A TRELLIS  MATCHED  TO  A LENGTH  3 DECODER 


Fig.  6 PREDICTIVE  QUANTIZATION  SYSTEM 


Y (e  ) 


n n 


Fig.  7 LINEAR  PREDICTIVE  QUANTIZER 


SNR 


h 


Fijf.  11  LOOK  AHEAD  PREDICTIVE  QUANTIZATION 
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SNR 
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16 


12 
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Fin.  15  LOOK  AHEAD  DELTA  MODUIATION 


. 5 

AUTOREGRESSIVE  CONSTANT 
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SCRAMBLING 

FUNCTION 


♦ 
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Fig.  19  SCRAMBLING  FUNCTION  SCRAMBLER 


I 

\ 


65 


DISTORTION 


SNR 


5 6 


8 9 

DECODER  LENGTH 


PERFORMANCE:  INVERSE  DISTRIBUTION/SCRAMBLING  FUNCTION  DECODER 
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(a)  AUTOREGRESSIVE  FILTER 

( 


F (x  ) 


Fig.  23  DISTRIBUTION  FUNCTIONS  OF  REAL  GAUSSIAN  AND  FAKE  PROCESSES 


SOURCE:  FIRST  ORDER 

GAUS-MARKOV 

RATE:  IB  IT/SYMBOL 

L = SEARCH  DEPTH 

DECODER  LENGTH:  5 

15  PERIOD  SCRAMBLING  FUNCTION 

ENCODER:  INCREMENTAL  TREE 

SEARCH 

INNOVATION  QUANTIZA' 


AUTOREGRESSIVE  CONSTANT 
PERFORMANCE:  FAKE  AUTOREGRESSIVE  DECODER 


SOURCE:  FIRST-ORDER  MOVING  AVERAGE 
RATE:  1 BIT/SYMBOL 

SHIFT  REGISTER  LENGTH:  6 
31  PERIODS  SCRAMBLING  FUNCTION 
ENCODER:  VITERBI  ALGORITHM 


OPTIMAL  QUANTIZATION 


DELTA  MODULATION 


MOVING  AVERAGE  CONSTANT 


Fig.  26  PERFORMANCE:  FAKE  MOVING  AVERAGE  PROCESS 
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over  the  traditional  schemes  at  a rate  of  1 bit/symbol.  The  inevitable  in- 
crease in  complexity  is  moderate  in  most  cases. 
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